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Abstract. While previous work on energy-efficient algorithms focused 
on assumption that tasks can be assigned to any processor, we initially 
study the problem of task scheduling on restricted parallel processors. 
The objective is to minimize the overall energy consumption while speed 
scaling (SS) method is used to reduce energy consumption under the 
execution time constraint (Makespan Cmax)- In this work, we discuss 
the speed setting in the continuous model that processors can run at 
arbitrary speed in [smin, Smax] - The energy-efficient scheduling problem, 
involving task assignment and speed scaling, is inherently complicated as 
it is proved to be NP-Complete. We formulate the problem as an Integer 
Programming (IP) problem. Specifically, we devise a polynomial time 
optimal scheduling algorithm for the case tasks have a uniform size. Our 
algorithm runs in 0{Tnn'^logn) time, where m is the number of proces- 
sors and n is the number of tasks. We then present a polynomial time 
algorithm that achieves an approximation factor of 2"~^(2 — (a is 
the power parameter) when the tasks have arbitrary size work. Experi- 
mental results demonstrate that our algorithm could provide an efficient 
scheduling for the problem of task scheduling on restricted parallel pro- 
cessors. 



1 Introduction 

Energy consumption has become an important issue in parallel processor compu- 
tational systems. Dynamic Speed Scaling (SS) is a popular approach for energy- 
efficient scheduling to significantly reduce energy consumption by dynamically 
changing the speeds of the processors. The well-known relationship between 
speed and power is the cube-root rule, more precisely, that is the power of a 
processor is proportional to when it runs at speed s [1, 2]. Most research 
literatures [3, 4, 5, 6, 7, 8, 9, 10] have assumed a more general power function 
s", where a > 1 is a constant power parameter. Note that it is a convex function 
of the processor's speed. Obviously, energy consumption is the power integrated 
over duration time. Higher speeds allow for faster execution, at the same time, 
result in higher energy consumption. 

In the past few years, energy-efficient scheduling has received much atten- 
tion from single processor to parallel processors environment. In the algorithmic 
community, the approaches can (in general) be categorized into the following 
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two classes for reducing energy usage [5, 7]. (1) Dynamic speed scaling: The 
processors lower down the speed to execute tasks as much as possible while 
fulfil their timing constraints. The reason behind energy saving via this strat- 
egy is the convexity of the power function. The gold is to decide the processing 
speeds in a way that minimizes the total energy consumption and guarantees 
the prescribed deadline. (2) Power-down management: The processors will be 
put into the power-saving state when they are idle. But it is energy-cost for 
transiting back to the active state. This strategy is to determine whether there 
exist idle periods that can outweigh the transition cost and decide when to wake 
the power-saving mode in order to complete all tasks in time. Our paper focuses 
on energy-efficient scheduling via dynamic speed scaling strategy. In this policy, 
the goals of scheduling are either to minimize the total energy consumption or 
to trade ofi^ the conflicting objectives of energy and performance. The main dif- 
ference is the fornicir one; reduces the total energy consumption as long as the 
timing constraint is not violated, while the later one seeks the best point between 
the energy cost and performance metric (such as makespan and flow time). 

Speed scaling has been widely studied to save energy consumption initiated 
by Yao et al. [3]. The previous work considers that a task can be assigned to 
any processor. But it is natural to consider the restricted scheduling in modern 
computational systems. The reason is that the systems evolve over time, such as 
cluster, then the processors of the system are created difi'erently (For instance, 
the processors have different additional components). This leads to the task can 
only be assigned to the processors, which has the task's required component. In 
practice, certain tasks may have to be allocated for certain physical resources 
(such as GPU) [11], that is, the tasks must be assigned to some processors. It is 
also pointed out that some processors whose design is specialized for particular 
types of tasks, then tasks should be assigned to a processor best suited for 
them [12]. Furthermore, when considering tasks and input data, tasks need to 
be assigned on the processors containing their input data. In other words, a 
part of tasks can be assigned on processors set Ai, and a part of tasks can be 
assigned on processors set Aj, but Aiy^Aj, Air]Aj=^(l). Another case in point is 
the scheduling with processing processor restrictions aimed at minimizing the 
makespan has been studied extensively in the algorithmic community (Sec [13] 
for an excellent survc^y). Tlic;rcfore, it is significant to study the schcdiiling with 
processor restrictions from both of practical and algorithmic requirements. 

Previous Work: Yao et al. [3] were the first to explored the problem of 
scheduling a set of tasks with the smallest amount of energy on single processor 
environment via speed scaling. They proposed an optimal offline greedy algo- 
rithm and two bounded online algorithms named Optimal Available and Average 
Rate. Ishihara et al. [4] formulated the minimization-energy of dynamical voltage 
scheduling (DVS) as an integer linear programming problem when all tasks were 
ready at the beginning and shared common finishing time. They showed that 
in the optimal solution a processor only runs at two adjacently discrete speeds 
when it can use only a small number of discrete processor speeds. 
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Besides studying variant of the speed scaling problems on single processor, re- 
searchers also carried out studies on parallel processors cnvironnicint. Chen et al. 
[6] considered energy-efficient scheduling with and without task migration over 
multiprocessor. They proposed approximation algorithm for different settings of 
power characteristics where no task was allowed to migrate. When task migration 
is allowed and migration cost is assumed being negligible, they showed that there 
is an optimal real-time task scheduling algorithm. Albers et al. [7] investigated 
the basic problem of scheduling a set of tasks on multi-processor settings with an 
aim to minimize the total energy consumption. First they studied the case that 
all tasks were unit size and proposed a polynomial time algorithm for agreeable 
deadlines. They proved it is NP-Hard for arbitrary release time and deadlines 
and gave a Q!"2^"-approximation algorithm. For scheduling tasks with arbitrary 
processing size, they developed constant factor approximation algorithms. Aupy 
et al. [2] studied the minimization of energy on a set of processors for which 
the tasks assignment had been given. They investigated different speed scaling 
models. Angel et al. [10] consider the multiprocessor migratory and preemptive 
scheduling problem with the objective of minimizing the energy consumption. 
They proposed an optimal algorithm in the case where the jobs have release 
dates, deadlines and the power parameter a > 2. 

There were also some literatures to research the performance under an en- 
ergy bounded. Pruhs et al. [8] discussed the problem of speed scaling to opti- 
mize makespan under an energy budget in a multiprocessor environment where 
the tasks had precedence constraints {Pm\prec, energy \Cmax , m is the number 
of processors). They reduced the problem to the Qm\prec\Craax and obtained 
a poly-log(m)-approximation algorithm assuming processors can change speed 
continuously over time. The research by Greiner et al. [9] was a present to study 
the trade off between energy and delay, i.e., their objective was to minimize the 
sum of energy cost and delay cost. They suggested a randomized algorithm TZA 
for multiple processors: each task was assigned uniformly at random to the pro- 
cessors, and then the single processor algorithm A was applied separately to each 
processor. They proved that the approximation factor of TZA was (SBa without 
task migration when A was a /3-approximation algorithm [Ba is the a-th Bell 
number). They also showed that any /3-competitive online algorithm for a sin- 
gle processor yields a randomized /?_Bc(-competitive online algorithm for miiltiple 
processors without migration. Using the method of conditional expectations, the 
results could be transformed to a derandomized version with additional running 
time. Angel et al. [10] also extended their algorithm, which considered minimiz- 
ing the energy consumption, to obtain an optimal algorithm for the problem of 
maximum lateness minimization under a budget of energy. 

However, all of these results were established without taking into account the 
restricted parallel processors. More formally, let the set of tasks J and the set 
of processors V construct a bipartite graph G = (J" + P,E), where the edges 
E denote a task can be assigned to a processor. The previous work study G 
is a complete bipartite graph, i.e., for any two vertices, v\^J and V2&V, the 
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edge V1V2 is in G. We study the energy-efficient scheduling that G is a general 
bipartite graph, i.e., V1V2 may be not an edge of G. 

Our contribution: In this paper, we address the problem of task Scheduling 
with the objective of Energy Minimization on Restricted Parallel Processors 
(SEMRPP). It assmnes all tasks are ready at time and share a common deadline 
(a real-time constraint) [2, 4, 6, 7]. In this work. We discuss the continuous speed 
settings that processors can run at arbitrary speed in [smin,Smax]- We propose 
an optimal scheduling algorithm when all the tasks have uniform computational 
work. For the general case that the tasks have non-uniform computational work 
we prove that the minimization of energy is NP-Complete in the strong sense. 
We give a 2"~^(2 — ^)-approximation algorithm, where a is the power param- 
eter and m is the number of processors. The performance of the approximation 
algorithm is evaluated through a set of experiments after algorithm analysis, and 
it turns out effective results to confirm the proposed scheduling work efficiently. 
To the best of our knowledge, our work may be the initial attempt to study 
energy optimization on the restricted parallel processors. 

The remainder of this paper is organized as follows. We provide the formal 
description of model in Sections 2. Section 3 discusses some preliminary results 
and formulate the problem as an Integer Programming (IP) problem. In Section 
4, we devise a polynomial time optimal scheduling algorithm in the case where 
the tasks have uniform size. In Section 5, we present a bounded factor approx- 
imation guarantee algorithm for the general case that the tasks have arbitrary 
size work. Section 6 shows the experimental results. Finally we conclude the 
paper in Sections 7. 

2 Problem and Model 

We model the SEMRPP problem of scheduling a, set J = {Ji, J2: Jn} of n 
independent tasks on a set 7-" = {Pi,P2, ...,Pm} of m processors. Each task Jj 
has an amount of computational work wj which is defined as the number of the 
required CPU cycles for the execution of Jj [3]. We refer to the set Aij C V 
as eligibility processing set of the task Jj, that is, Jj needs to be scheduled on 
one of its eligible processors A4 j{M.j 7^ (f)). We also say that Jj is allowable 
on processor Pi G A^j, and is not allowed to migrate after it is assigned on a 
processor. A processor can process at most one task at a time and all processors 
are available at time 0. 

At any time t, the speed of Jj is denoted as Sjt, and the corresponding 
processing power is Pjt = {sjt)". The amount of CPU cycles Wj executed in a 
time interval is the speed integrated over duration time and energy consumption 
Ej is the power integrated over duration time, that is, Wj = / Sjtdt and Ej = 
J Pjtdt, following the classical models of the literature [2, 3, 4, 5, 6, 7, 8, 9, 10]. 
Note that in this work we focus on speed scaling and all processors are alive 
during the whole execution, so we do not take static energy into account [2, 8]. 
Let Cj be the time when the task Jj finishes its execution. Let Xij be an — 1 
variable which is equal to one if the task Jj is processed on processor Pj and 
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zero otherwise. We note that Xij = if Pi ^ Mj. Our goal is scheduling the 
tasks on processors to minimize the overall energy consumption when each task 
could finish before the given common deadline C and be processed on its eligible 
processors. Then the SEMRPP problem is formulated as below: 

n „ 

(Po) min^^ I Pjtdt 
s.t. Cj < C \/Jj, 

m 

i=l 

Xij{l-Xij)=0 'iJj,Pi^Mj, 
3 Preliminary Lemma 

We start by giving preliminary lemmas for reformulating the SEMRPP problem. 

Lemma 1. IJ S is an optimal schedule for the SEMRPP problem in the contin- 
uous model, it is optimal to execute each task at a unique speed throughout its 
execution. 

Proof. Suppose S is an optimal schedule that some task Jj does not run at a 
unique speed during its execution. We denote Jj's speeds by Sji, Sj2, sjk, the 
power of each speed i is {sji)",i = (l,2,...,fc), and the execution time of the 

speeds are tji,tj2, tjk, respectively. So, its energy consumption is Yli=i ^jii^jiY 
We average the k speeds and keep the total execution time unchanged, i.e., 
Sj = (X]i=i '^jj^ji)/(5^i=i ^ji)- Because the power function is a convex function 
of speed, according to convexity [14] (In the rest of paper, it will use convexity 
in many place but will not add reference [14]), we have 

i=l i=l i=l Z^i=l *ji 

>(E*.0(E#^)" = (E^..)(^-.•)" 

i=l i=l Z^i=l ''J* i=l 

fe 

i=l 

So the energy consumption by unique speed is less than a task run at different 
speeds. I.e. , if we do not change Jj 's execution time and its assignment processor 
(satisfying restriction), we can get a less energy consumption scheduling, which 
is a contradiction to that S is an optimal schedule. 
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Corollary 1. There exists an optimal solution for SEMRPP in the continuous 
model, for which each processor executes all tasks at a uniform speed, and finishes 
its tasks at time C. 

All tasks on a processor run at a unique speed can be proved like Lemma 1. 
If some processor finishes its tasks earlier than C, it can lower its speed to 
consume less energy without breaking the time constraint and the restriction. 
Furthermore there will be no gaps in the schedule [8]. 

Above discussion leads to a reformulation of the SEMRPP problem in the 
continuous model as following: 

n 

(Pi) mmY, '~^^oc-i 



s.t. ^XijWj<SmaxC MPi, (1) 

m 

Xij ( 1 - ) = MJj , Pi G Mj , (3) 

Xij=0 yjj,Pi^Mj. (4) 

The objective function is from that a processor Pi runs at speed ^ = 

^ ^j^g^^ jg ea,ch task on Pi will run at this speed, and Pi will complete all 
the tasks on it at time C (It assumes that, in each problem instance, the compu- 
tational cycles of the tasks on one processor is enough to hold the processor will 
not run at speed Si < Smin- Otherwise we are like to turn off some processors). 
Constraint (1) follows since a processor can not run at a speed higher than Smax- 
Constraint (2) relates to that if a task has assigned on a processor it will not be 
assigned on other processors, i.e, non-migratory. Constraint (3) and (4) are the 
restrictions of the task on processors. 

Lemma 2. Finding an optimal schedule for SEMRPP problem in the continuous 
model is NP- Complete in the strong sense. 

Proof. We consider an instance of the SEMRPP problem that ^Aj = V for all 
tasks Jj and Smax is fast enough to assure a feasible schedule for the given tasks. 
By the convexity of the function f{s) = s°'{a > 1), we note that the optimal 
schedule is to avcragcly partition the tasks to processors. Then we can finish the 
proof by a pseudo-polynomial reduction from the 3-PARTITION problem. 

Consider an instance of 3-Partition: Given a list A = (oi, a2, asm) of 
3m positive integers such that = mB, j < aj < ^ for each l<j<3m, is 

there a partition of A into Ai,A2,...,Am such that J2ajeAi^j — ^ e&ch 
l<i<m? [15, 16], we construct an instance of SEMRPP problem as follows. 
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There are 3m tasks for whose execution cycles are equal to aj and there are 
m processors. The deadline C — 1 and the energy consumption is mB". De- 
note the execution cycles of processors as {hi, h2, hm)- According to (Pi), 
the energy consumption is Y^iLiihi)". By convexity, we have J^iLii^T-i)" = 
mET=iT^ih,)">miT^ET=ihi)" = mB^ (Note that YZlh^ = mS). The en- 
ergy consumption is equal to mB" if and only if hi = h2 = ■■■ = hm = B. 
Thus, there is an optimal schedule if and only if there is a 3-Partition. It is 
clear that the above reduction is a pseudo-polynomial reduction. So we can con- 
clude that SEMRPP in the continuous model is strongly NP-Complete by this 
pseudo-polynomial time reduction to 3-PARTITION problem which has been 
proved NP-Complete in the strong sense. 

Lemma 3. There exists a polynomial time approximation scheme (PTAS) for 
the SEMRPP problem in the continuous model, when Aij = V and Smax is fast 
enough. 

Proof. The proof is a little similar to [8] whose aim is giving a PTAS for the prob- 
lem that measures the makespan under an energy bounded {Sm\energy\Cmax)- 
It turns out that the SEMRPP problem is equivalent to minimizing the la norm 
of the loads [17] from the description of Lemma 2 (see X^I^LiC**)" ^'^'^ ^ is 
a constant power parameter). Then we use the PTAS given in [17], that is, for 
any e > 0, we can find the sum of the execution cycles of the tasks on pro- 
cessor Pi (denoted as load below) Li, L2, Lm in polynomial time such that 
^»=i(i»)"<(l + e)^z=i(OPTi)", where L, is the load of scheduling and OPTi 
is the optimal load for processor Pi, respectively. 

Note that we give the detail proof of Lemma 2 and Lemma 3 that were similarly 
stated as observations in the work [7], and we mainly state the conditions when 
they are established in the restricted environment, (such as the set of restricted 
processors and the upper speed Smax that we discuss below in the paper) 

4 Uniform tasks 

We now propose an optimal algorithm for a special case of SEMRPP problem 
for which all tasks have equal execution cycles (uniform) (denoted as ECSEM- 
RPP_Algo algorithm). Note that we can set Wj = and set C = C/wj m. 

(Pi) without loss of generality. Given the set of tasks J , the set of processors 
V and the sets of eligible processors of tasks {A^^}, we construct a network 
G = {V, E) as follow: the vertex set of G is = J7 U U {s, t} (s and t corre- 
spond to a source and a destination, respectively), the edge set E of G consists 
of three subsets: (l)(s,PO for all P^eV; (2)(Pi, J,) for Pi&Mf, i3){Jj,t) for ah 
Jj&J. We set unit capacity to edges {Pi, Jj) and {Jj,t), {s,Pi) have capacity c 
(initially we can set c = n). Define L* = min{max{Li}}{i = 1,2, ...,m), Li is 
the load of processor Pi and it can be achieved by Algorithm 1 . 

For a positive number a>l, the la norm of a vector x = {xi, X2, Xn) is defined by 

M^{\xir + \x,r + ... + \x„ni 
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Algorithm 1: BS_Algo(G,n) 
input : (G, n) 

output: L* , Pi that have the maximal load, the set J7i of tasks that load on Pi 
1: Let / = 1 and u = n; 

2: If / = u, then the optimal value is reached: L* = I, return the Pi and Ji, stop; 
3: Else let c = LiC + w)J. Find the Maximum-flow in the network G. If the 
value of Maximum- flow is exact n, namely L*<c, then set u = c and keep Pi, Ji 
by the means of the Maximum-flow. Otherwise, the value of Maximum-flow is 
less than n, namely L* > c, we set Z = c -|- 1. Go back to 2. 



Lemma 4. The algorithm BS-Algo solves the problem of finding minimization 
of maximal load of processor for restricted parallel processors in 0{v?logn) time, 
if all tasks have equal execution cycles. 

Its proof can mainly follow from the Maximum-flow in [18]. The computational 
complexity is equal to the time 0{n^) to flnd Maximum-flow multiple logn steps, 
i.e, 0{n^logn). 

We construct our ECSEMRPP_Algo algorithm [Algorithm 2) through flnd- 
ing out the min-max load vector I that is a strongly-optimal assignment deflned 
in [17, 19]. 

Definition 1. Given an assignment H denote by Sk the total load on the k most 
load of processors. We say that an assignment is strongly-optimal if for any other 
assignment H (Si. accordingly responds to the total load on the k most load of 
processors) and for all l<k<m we have Sk<Sj^. 



Algorithm 2: ECSEMRPP_Algo 

1: Let Go = G{V, E), V" = cf>, j'' = ^^}; 
2: Call BS^lgo{Go,n); 

3: Set i = i + 1. According to the scheduling returned by step 2, we note the 
processor P/^ that have actual maximal load and note its task set Ji^ . Ei^ 
corresponds to the related edges of P^ and J'i^ . We set 

Go = {VXPi'XJi", £\ff }, V" = V"V}{Pi'}, <f>i = Ji'. If Gof^0, go to step 2; 

4: We assign the tasks of Jf^ to P" and set all tasks at speed — qu 

Pi^ . Return the final schedule H. 



Theorem 1. Algorithm ECSEMRPP_Algo finds the optimal schedule for the 
SEMRPP problem in the continuous model in 0{mn^logn) time, if all tasks 
have equal execution cycles. 

Proof. First we prove the return assignment H of ECSEMRPP_Algo is a strongly- 
optimal assignment. We set H = {Li, L2, Lm}, Li corresponds to the load 
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of processor in non-ascending order. Suppose H is another assignment that 
H and {L^, ■•■) -^m} corresponds to the load. According to the ECSEM- 
RPP_Algo algorithm, we know that H can only be the assignment that Pi moves 
some tasks to Pj{j < i), because Pj can not move some tasks to Pj' {j >i) other- 
wise it can lower the Li which is a contradiction to ECSEMRPP_Algo algorithm. 
We get Sl^iLi<Sl.^-^L^, i.e., if is a strongly-optimal assignment by the defini- 
tion. It turns out that there does not exist any assignment that can reduce the 
difference between the loads of the processors in the assignment H. I.e., there 
are not other assignment can reduce our aim as it is convexity. So the optimal 
scheduling is obtained. 

Every time we discard a processor, so the total cost time is mxO{n^logn) = 
Oimn^logn) according to Lemma 4, which completes the proof. 

5 General tasks 

As it is NP-Complete in the strong sense for general tasks {Lemma 2), we aim 
at getting an approximation algorithm for the SEMRPP problem. First we relax 
the equality (3) of (Pi) to 

Q<Xij<l \/Jj,Pi&Mj (5) 

After relaxation, the SEMRPP problem transforms to a convex program. 

The feasibility of the convex program can be checked in polynomial time to 
within an additive error of e (for an arbitrary constant e > 0) [20], and it can be 
solved optimally [14] . Suppose x* be an optimal solution to the relaxed SEMRPP 
problem. Now our goal is to convert this fractional assignment to an integral one 
X. We adopt the dependent rounding introduced by [16, 19, 21]. 

Define a bipartite graph G{X*) = {V,E) where the vertices of G are V = 
J'UV and e = {i, j)(lzE if x*j>0. The weight on edge is x*^Wj. The roimding 
iteratively modifies x*j, such that at the end x*j becomes integral. There are 
mainly two steps as followingO: 

i. Break cycle: 

1. While(G'(a;*) has cycle C = (ei, 62, e2/-i, 62/)) 
2. Set Ci = (ei,e3,...,e2z-i) and C2 = (e2, 64, 62;)- 

Find minimal weight edge of C, denoted as e^^^ and its weight e = 
"^«^^eeCl||eeC2e; 

3. If e^^„GCi then every edge in Ci subtract e and every edge in C2 add e; 

4. Else every edge in Ci add e and every edge in C2 subtract e; 

5. Remove the edges with weight from G. 

ii. Rounding fractional tasks: 

l.In the first rounding phase consider each integral assignment if x^j = 1, set 
Xij = 1 and discard the corresponding edge from the graph. Denote again by G 
the resulting graph; 

2. While(G(a;*) has connected component C) 
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3. Choose one task node from C as root to construct a tree Tr, match each 
task node with any one of its children. The resulting matching covers all task 
nodes; 

4. Match each task to one of its children node (a processor) such that Pt = 
argminp-^-pX!xij=iXijWj, set Xij — 1, and Xij — for other children node re- 
spectively. 

Lemma 5. Relaxation- Dependent rounding finds an 2°' -approximation to the 
optimal schedule for the SEMRPP problem in the continuous model in polynomial 
time. 

Proof. This can be concluded using the results of [19], we omit here. 

Next we improve this result by analyzing carefully for the SEMRPP problem by 
generalizing the result of Lemma 5. 

Theorem 2. (i) Relaxation-Dependent rounding finds an2°'~^{2—^)-approximation 
to the optimal schedule for the SEMRPP problem in the continuous model in poly- 
nomial time, where p — maxMj \Mj\<rn. (m) For any processor Pi, SjXijWj < 
Sjx*jWj + rnax j.x*^(^Q i-^Wj , x*^ is the fractional task assignment at the begin- 
ning of the second phase, (i.e., extra maximal execution cycles linear constraints 
are violated only by rnaxj.x*.£(o,i)Wj) 

Proof (i) Denote the optimal solution for the SEMRPP problem as OPT, H* as 
the fractional schedule obtained after breaking all cycles and H as the schedule 
returned by the algorithm. Moreover, denote by Hi the schedule consisting of the 
tasks assigned in the first step, i.e., x*j = 1 right after breaking the cycles and 
by i?2 the schedule consisting of the tasks assigned in the second rounding step, 
i.e., set Xij = 1 by the matching process. We have ||i?i||Q<||-ff*I|Q<||OPr||a H, 
where the first inequality follows from the fact that Hi is a sub-schedule of H* 
and the second inequality results from H* being a fractional optimal schedule 
compared with OPT which is an integral schedule. We consider ||-ffiI|a<||-ff*||Q 
carefully. If ||iJi||Q = that is all tasks have been assigned in the first step 

and the second rounding step is not necessary, then we have ||-ffi||a = 11-^^*11 a = 
1 1 OPT 1 1 Q. Such that the approximation is 1. Next we consider ||i?i||a < 
so there are some tasks assigned in the second rounding step, w.l.o.g., denote as 
iTi = {Ji, Jk}. We assume the fraction of task Jj assigned on processor Pi 
is fij and the largest eligible processor set size p — maxMj\Mj\<m. Then we 



^ In Hi schedule, when the loads of m processors is {li^ ,l2^ , l'^}, ||-ffi|[ci means 

{{I'l'r + {i^'r + - + {it'r)i 
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have 

m 
i=l 

m m 

rn 

= {\\H,\ur+Y,{sj,ejjiir 



m k 



>(ii//iiur+^^(/,,r 
i=i j=i 

k rn 

= {\\Hi\ur+Y.Y.^fi,r 
j=i i=i 

k ■^-vm r 



(6) 



From the fact that H2 schedules only one task per processor, thus optimal inte- 
gral assignment for the subset of tasks it assigns and certainly has cost smaller 
than any integral assignment for the whole set of tasks. In a similar way we have 

k 

{\\H,\ur = Y.^w,r<{\\oPT\ur (7) 

So the inequality (6) can be reduced to 

{w\urH\\H.\ur + ^{\mur (§) 



then 



_.a A\Hl\\a + \\H2\U 

^ 2 ' 

<2'^-\m*\ur - ^{wur + m2\ur) 

<2-\2-^){\\OPT\Ur 
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So 



(II^IU) 




iWOPTU 



Which conchidcH the proof that the schedule H guarantees a 2"^^ (2 — ^)- 
approximation to optimal solution for the SEMRPP problem and can be found 
in polynomial time. 

{a) Seen from above, we also have 



Where the inequality results from the fact that the load of processor in H 
schedule is the load of H* plus the weight of task matched to it. Because we 
match each task to one of its child node, i.e., the execution cycle of the adding 
task Wj < maxj.(zj.,^*^(i(o^i)Wj. 

Now we discuss the Smax- First we give Proposition 1 to feasible and violation 
relationship. 

Proposition 1. // (Pi) has feasible solution for the SEMRPP problem in the 
continuous model, we may hardly to solve (Pi) without violating the constraint 
of the limitation of the maximal execution cycles of processors. 

Obviously, if (Pi) has a unique feasible solution, i.e., the maximal execution 
cycles of processors is set to the OPT solution value. Then if we can always solve 
(Pi) without violating the constraint, this means we can easily devise an exact 
algorithm for (Pi). But we have proof that (Pi) is NP-Complctc in the strong 
sense. Next, we give a guarantee speed which can be regarded as fast enough on 
the restricted parallel processors scheduling in the dependent rounding. 

Lemma 6. Dependent rounding can get the approximation solution without vi- 
olating the maximal execution cycles of processors constraint when 
SmaxC>maxp^erLi + maxj^^jWj, where Li = Sj.^j.-y^^Wj, Ji is the set of 
tasks that can be assigned to processor Pi . 

Proof. First we denote a vector H = {Hi^H^, -ffm} in non-ascending sorted 
order as the execution cycles of m processors at the beginning of the second step. 
We also denote a vector L = {Li, L2, L^} in non-ascending sorted order as 
the execution of m processors that Li — Sj.qj.jj^Wj. Now we need to prove 
Hi<Li. Suppose we have Hi > Li, w.l.o.g., assume that the processor Pi has 
the execution cycles of We denote the set of tasks assigned on Pi as jTj^. 
Let A4^ be the set of processors to which a task, currently fractional or integral 
assigned on processor Pi, can be assigned, i.e., = {Jj.^jh Mj- Similarly 
we denote the set of tasks can process on Ai^ as and the set of processors 
for every task in Pi&Mi can be assigned. We have = {jj.fzjH Mj. 
w.l.o.g, denote as a set {/ii, /12, /i/c}(l<fc<m), a set {/i, /2, lk}{^<k<m) as 
its respective processor execution cycles in L. According to the convexity of the 



^JjejXijWj < Sj.^jx*ijWj +maxj.Qj.,^*.^(^o,i)Wj,VPi 
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objective, we get Hh^ = Hh^ = ... = Hu^. By our assumption, Hh^ > Li^,\/p,\/q. 
Then 

SpHh^ > SgLi^ (9) 

Note that each integral task (at the beginning of the second step) in the left 
part of inequality (9) can also have its respective integral task in the right part, 
but the right part may have some fractional task. So SgLi^ — SpHfip>0, i.e., 
SpHhp<IJqLi^, a contradiction to inequality (9). The assumption is wrong, we 
have Hi<Li. By Theorem 2 the maximal execution cycles of dependent rounding 
Hmax, we have 

Hmax < Hi +maxj.^j;x'.e{o,i)Wj 

<Li + maxjjQjWj = maxiLi + maxj^^jWj 

Finish the proof. 

6 Experimental Results 

In this section, we provide performance detail of experimental results. To demon- 
strate the effectiveness of our approaches, we compare 5 values of interest, the 
optimal fractional solution, the optimal integral solution, the fractional depen- 
dent rounding integral (FDR, in the rest of paper, it refers to the solution of our 
algorithm) solution, the least flexible task (LFJ) solution and the least flexible 
processor (LFM) solution. We use the CPLEX solver [22] to obtain the optimal 
integral solution by solving the relevant Integer Programming. For our approxi- 
mation algorithm, we obtain the optimal fractional solution by CVX solver [23], 
and then apply the dependent rounding by our algorithm. The results of LFJ 
and LFM solutions are obtained by following LFJ and LFM algorithms. 

LFJ ALGORITHM. The tasks first are sorted in non-decreasing order of the 
cardinality of the processing sets of them, i.e., by \Aij\. All the tasks arc then 
scheduled in this order by sequential list. Next the task is assigned to a processor 
Pi which has the least load and is in the task's processing set (P, G At the 
last the speed of a processor is set to a value that the processor finishes its load 
by the time constraint; LFM ALGORITHM. The processors first are sorted in 
non-decreasing order of the cardinality of the processing task sets of them. The 
processors are then scheduled in this order by sequential list. Next the processor 
chooses a task which can be assigned on it and has not been assigned to other 
processors. At the last the speed of a processor is set to a value that the processor 
finishes its load by the time constraint. Note that the main difference between 
LFJ and LFM algorithm is the tasks or the processors as the object to select 
the processors or the tasks, correspondingly. 

6.1 Simulation Setting 

To evaluate the performance of our algorithm, we create systems consisting of 
10 to 50 processors and 50 to 300 tasks. Each task Jj is characterized by two 
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parameters: the mount of the execution cycles Wj and ehgibility processing set 
Mj. Wj is randomly generated in the range [1, 10000]. We simulate two case for 
Aij . One is randomly generated from the set V of processors, and the other is 
arranged to construct the inclusive processing set restrictionqj [9] . Without loss 
of generality, the power parameter a is set as 2 [2]. The maximal speed Smax is 
set to large enough to obtain the feasible solution. We analyse the effect of three 
different cases: the tightness of time constraint C, the ratio rj of the number of 
tasks to the number of processors, and the two different eligibility processing 
sets. All the results are mean values of different runs on an Intel Core 15-2400 
CPU with 3.10Gi?zx4. 

6.2 Simulation Results 

Figure 1(a) represents the energy consumption of a 10 processors and 27 tasks 
system when the time constraint is increased. The five curves correspond to 5 
values that we mention for comparing at the beginning of this section. Figure 
1 (b) reports the relative energy consumption ratio of these 5 values when all of 
them are normalized by the optimal integral. We find some observations from 
this simulation: 1). As shown in the Figure 1(a), 1(b), the energy consumption 
and the time constraint are in inverse proportion, and each ratio is almost not 
influenced by different time constraints. These confirms the Lemma 1 and Corol- 
lary 1, i.e., each processor executes all tasks that are assigned on it at a uniform 
speed. So when the time constraint C grows to kxC, each processor can lower 
its speed to j: to finish the tasks. For a = 2, the energy consumption is equal 

to (=— -^^T*^ ~ i) proportion of the energy consumption when the time 
constraint does not grow. Thus each kind energy consumption is influenced by 
the same proportion to the time constraint variation, when normalized by the 
optimal integral, the time constraint can be removed. This concludes the Fig- 
ure 1(b). 2). The optimal fractional values are little different from the integral 
optimal. The Gap is at most 5% in the experiment. This difference can also be 
observed between the integral optimal and the fractional dependent rounding 
integral solution, actually it is also within 5% in the experiment. This suggests 
that the FDR performs much better than the approximation ratio we analysed 
in Theorem 2.3). The figure confirms the superiority of the fractional dependent 
rounding integral solution, as it can reach 10% better than the LFJ and LFM 
solution. After checking the maximum processor load, we find the result of the 
fractional dependent rounding is close to the integral optimal. This suggests the 
fractional dependent rounding integral solution can more efficiently balance the 
load between each eligibility processing set. 

Figure 2(a) depicts the normalized energy consumption ratios for different 
solutions on varying ratios rj of the number of tasks to the number of processors. 
When the ratio 77 is small, the difference between the normalized ratios is much 
larger. This can be explained by the fact that only one task be improperly 

^ Inclusive processing set means that the pair restricted processing sets Mj and A4k 
for any two different tasks, either Aij'^Aik or Mk'!=Mj 
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assigned, the energy consumption would be excessively oscillated if t] is small. As 
the 77 increasing, the shake will reduce because an improper task assignment will 
not influence so much. Figure 2(b) illustrates the normalized energy consumption 
ratios of a 14 processors and 35 tasks system for two eligibility processing sets. 
As shown in the figure, the different eligibility processing sets can influence the 
performance of the algorithms. The FDR and LFJ solution perform better in 
random processing set case. This can be explained by that in the LFJ and FDR 
(At the last stage when rounding fractional tasks to processors) solution the task 
chooses its processor, and the random restriction help the task do proper choice, 
but the difference is not so obvious. On the contrary, the LFM solution in which 
a processor chooses the tasks performs much better in inclusive processing set 
case. This can be explained by that the processor which has the less eligible tasks 
first select a task, if it does a improper choice, the subsequent processors will not 
influence much as they have more tasks to choose in inclusive processing set case. 
And it is interesting to observe that the algorithms perform much differently in 
random condition and regular condition. 
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Fig. 1. (a) Energy consumption and (b) Normalized energy consumption ratio on time 
constraint. 



The average running time for the optimal fractional solution solved by CVX, 
the fractional dependent rounding integral solution solved by CVX and rounding, 
the LFJ solution solved by LFJ algorithm and the LFM solution solved by 
LFM algorithm are fast (In our experiment it took at most several minutes) 
to all the instances presented so far. But the optimal integral solution solved 
by CPLEX takes more than one day in large systems. For larger systems, the 
optimal integral solution has trouble in both memory and running time. Note 
that during all the experiments, the FDR solution is efficient than LFJ and 
LFM solution. This suggests that our solution could assign tasks more properly 
in every instance, and solve the SEMRPP problem efficiently due to high quality 
and low computational time. 
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Fig. 2. (a) Normalized energy consumption ratio on varying ratios 77 (Tiie optimal 
integral value misses at the last point for it can not be obtained. The other values are 
normalized by the optimal fractional value.) and (b) Normalized energy consumption 
ratio on two eligibility processing sets (0-4 represent each value, respectively). 



We emphasize that, as per the latest reports [24, 25], every year the energy 
costs are on the order of biUions of dollars. Given this, a reduction by even a few 
percent in energy cost can result in savings of billions of dollars. 

7 Conclusion 

In this paper we explore algorithmic instruments leading to reduce energy con- 
sumption on restricted parallel processors. We aim at minimizing the sum of 
energy consumption while the speed scaling method is used to reduce energy 
consumption under the execution time constraint (Cmax)- We first assess the 
complexity of scheduling problem under speed and restricted parallel proces- 
sors settings. We present a polynomial-time approximation algorithm with a 
2"~^(2 — ^)-approximation {p — maxMj\M.j\<m) factor for the general case 
that the tasks have arbitrary size of execution cycles. Specially, when the tasks 
have a uniform size, we propose an optimal scheduling algorithm with time com- 
plexity 0{mn^logn). We evaluate the performance of our algorithm by a set of 
simulated experiments. It turns out that our solution is very close to the optimal 
solution. This confirms our algorithm could provide efficient scheduling for the 
SEMRPP problem. 
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